Is there statistical evidence of fewer births on Feb 29 in the US?
An evaluation using 2000-2014 data
Leap day aversion? A 2012 calandar of daily births.
Is there statistical evidence that Feb 29 is an un-favored day to give birth in the US?
The 2012 calendar below, showing number of daily birth as represented by color and bubble size, shows quite a lot of variation in the number of births from one day to another. For example December 25th’s recorded number of births is small compared with its neighbors, and births on the weekends appear to be much lower than weekday counterparts.
With February 29th (leap day) coming up this week, you might wonder, are there also relatively fewer births on February 29th, circled on the calendar. Feb 29th is a weird birthday to have – only truly ‘celebrated’ every four years. But it’s hard to draw any conclusions about a dip in births being more than normal variation in daily births from this cursory look and from a single year of data. In what follows, we’ll assess the question, ‘Is there statistical evidence of preference against February 29th birthing based on number of daily births in the U.S. over a 15 year period, 2000 to 2014?’
I use ggcalendar functions to build this plot in a concise way.
library(ggcalendar)
births_df %>%
filter(year == 2012) %>%
ggcalendar() +
aes(date = date) +
geom_point_calendar() +
aes(size = births) +
aes(color = births) +
geom_text_calendar(aes(label = day(date)),
color = "oldlace",
size = 2) +
guides(
colour = guide_legend("Births"),
size = guide_legend("Births")
) +
geom_point_calendar(data = data.frame(date =
as_date("2012-02-29")),
size = 7, color = "red", shape = 21, stroke = 2) +
scale_color_viridis_c() +
labs(title = "The year in 2012 in births")Zooming in on the week of February 29, 2012 in the figure below, it’s easier to appreciate the dip in number of births that is observed on February 29th. But the question remains, Is this variability usual, or is there an aversion to birthing on February 29th that is statistically detectable looking at more leap days?
births_df |>
filter(date >= as.Date("2012-02-26") &
date <= as.Date("2012-03-03")) |>
ggplot() +
aes(x = date, y = births) +
geom_point() +
geom_line(linetype = "dashed") +
geom_label(aes(label = wday(date, label = T))) +
geom_point(data = . %>% filter(date == as.Date("2012-02-29")),
color = "red", shape = 1,
size = 20, stroke = 2, alpha = .8) +
labs(title = "Births the week of Feb 29, 2012") I begin the analysis with a visual format that’s familiar to the audience, and where we can be pretty sure of the effects of relevant variables just by inspection
2000-2014 U.S. Births Data
The 15 years of data we’ll look at was available via the #TidyTuesday project – their October 2, 2018 dataset. Some data cleaning is required including constructing a full date variable from year, month and date_of_month variables. I also normalize the dates (to 2020) for superimposed within-year comparisons - importantly 2020 is a leap year so Feb 29th dates are valid and not dropped. I further include indicator variables that I anticipate to be important as well as an indicator variable that captures the condition of interest, ind_Feb_29th.
library(tidyverse)
births_path <- "https://raw.githubusercontent.com/EvaMaeRey/tableau/9e91c2b5ee803bfef10d35646cf4ce6675b92b55/tidytuesday_data/2018-10-02-us_births_2000-2014.csv"
readr::read_csv(births_path) %>%
filter(year != 2015) %>%
mutate(month = str_pad(month, 2, pad = "0"),
date_of_month = str_pad(date_of_month, 2, pad = "0")) %>%
mutate(date = paste(year, month, date_of_month, sep = "-") %>% as_date()) %>%
mutate(ind_holiday =
(month == "12" & date_of_month %in% 24:31) |
(month == "07" & date_of_month %in% c("04", "05")) |
(month == "01" & date_of_month %in% c("01", "02")) |
(month == "10" & date_of_month == "31") |
(month == "11" & date_of_month %in% 20:30) |
(month == "02" & date_of_month %in% 14)
) |>
mutate(date_as_if_in_2020 = paste(2020, month, date_of_month, sep = "-") %>% as_date()) |>
mutate(ind_weekend = wday(date) == 1 | wday(date) == 7) |>
mutate(ind_Feb_29th = month(date) == 2 & day(date) == 29) |>
mutate(ind_13th = day(date) == 13) |>
mutate(ind_Fri13th = wday(date) == 6 & day(date) == 13) ->
births_dfVisual exploration
Baseline visualization of the time series
Visualizing the U.S. births time series, we see that there is tremendous variability in number of birth per day. The standard deviation for daily births is 2326 in this time span, with the average number of births around 11400, as marked in the visualization below.
I use ggxmean functions to show the mean value for daily births.
births_df |>
ggplot() +
aes(x = date, y = births) +
geom_point(size = .2) ->
time_series_base
add_y_mean <- function(){
list(ggxmean::geom_y_mean(linetype = "dotted"),
ggxmean::geom_y_mean_label() )
}
time_series_base +
add_y_mean()Univariate distribution
Looking at the time series above or the univariate distribution of daily births below, the bi-modal pattern is striking.
*I use function geom_x_mean()
# ggchalkboard:::geoms_chalk_on()
ggplot(births_df) +
aes(x = births) +
geom_histogram() +
ggxmean::geom_x_mean() +
ggxmean::geom_x_mean_label() +
geom_rug(alpha = .2) +
labs(x = "Number of births") +
# ggchalkboard::theme_chalkboard() +
labs(title = "Distribution of Number of Births in the U.S. each day from 2000-2014" %>% str_wrap(45)) ->
univariate_base; univariate_baseExploring bi-modality with weekend indicator
Breaking this data up, we explore the hypothesis that preference for scheduled birth (inducements or c-sections) is for weekdays. In the figure below see that most of the bimodality is explained by weekend v. weekday effects. The difference in means for these groups is substantial - around 4650 difference in number of births!
I use the ind_recode function from the ind2cat package which automatically recodes the indicator variable to meaningful categories. If the T/F indicator variable is used directly, in this plot facet labels would be TRUE and FALSE, and would be hard to interpret without looking a the plot source code.
Day of weeks effects
When we return to our time-series plot, but breaking the plots up by weekend, you may notice further bimodality within the ‘weekend’ subplot suggesting differences within the weekend for Saturday and Sunday.
time_series_base +
facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 1) +
labs(title = "We explore the bimodal distribution looking at 'weekend effects'" %>% str_wrap(45))Following on that observation, we investigate by-weekday (Sunday, Monday, Tuesday etc) differences in number of births below. We look first at the histogram and then the time series plot.
We add a confidence interval around the mean based on a t-test.
time_series_base +
facet_wrap(~wday(date, label = T,
week_start = "Monday"), ncol = 2) +
labs(title = "Number of births 2000-2014 by day of week" %>% str_wrap(45))Holidays and outlying observations
The by-day-of-the-week exploration (and weekend vs. weekday) further exposes some outlying observations. We next explore if these lower-than-usual their counterparts are by and large holiday or holiday adjacent days. Due to time constraints, I’m relying a quick indicator variable that I created that include many holiday and adjacent dates.
last_plot() -> tplot
tplot[[2]] <- NULL
tplot +
geom_point(alpha = .5, size = .5) +
aes(color = ind2cat::ind_recode(ind_holiday)) +
labs(color = NULL) +
theme(legend.position = 'top', legend.justification = "left")This exploration suggests that holidays drive much of the out-of-character observations.
Outlyingness might also be due to aversion due to superstition - for example the 13th of each month, especially Fridays the 13th, and Halloween might be more outlying though not holidays.
Also, holiday adjacency might also lead to lower number of births - for example federal holidays that fall on the weekend are often observed on a Monday. The outliers that we observe on Monday are not categorized as holiday, but this likely due to the imperfect ind_holiday coding. President’s day was not captured in my coding, for example and always falls on a Monday.
Because February 29 is not a holiday or holiday-adjacent, it is appropriate to prune our data to relevant comparisons and try to remove likely holidays; this should lead to more precise estimates in our final analysis.
Due to time constraints and convenience, our pruning will be based on outlyingness instead of relying on purely substantive knowledge (i.e. using lists of federal and celebrated holidays, etc).
We’ll prune observations for which a linear model’s error is greater than 3 standard deviations from the mean residual error (which will be zero).
The linear model contain date and day of the week as categories that explain number of daily births.
To start, I visualize the parallel lines model fit.
This visualization uses ggplot2 Stat extension via the new and experimental ggtemp package which allows creating new geom_ and stat_ functions with less ‘boiler plate’ code. A more syntactically conventional construction of this layer function is featured in the ggplot2 extension cookbook
compute_panel_lm_parallel <- function(data, scales){
model <- lm(y ~ x + I(x^2) + I(x^3)+ I(x^4) + category, data = data)
data |>
mutate(y = model$fitted)
}
ggtemp:::create_layer_temp(fun_name = "geom_ols_linear_parallel",
required_aes = c("x", "y", "category"),
compute_panel = compute_panel_lm_parallel,
default_aes = aes(color = after_stat(category)),
geom_default = GeomLine)
ggplot(births_df) +
aes(x = date, y = births,
color = wday(date, label = T), category = wday(date, label = T)) +
geom_point(alpha = .15) +
geom_ols_linear_parallel()Now, we estimate model outside of the visualization tool. Then we visualize the distribution of the model residuals and mark the threshold of 3 standard deviations from the mean, beyond which we’ll prune observations.
For convenience, we again use some functions from ggxmean
births_df$x <- as.numeric(births_df$date)
m <- lm(births ~ x + I(x^2) + I(x^3)+ I(x^4) + wday(date, label = T), data = births_df)
births_df$residuals_wday <- m$residuals
ggplot(births_df) +
aes(x = residuals_wday) +
geom_histogram() +
ggxmean::geom_x_mean() +
ggxmean:::geom_x3sd(linetype = "dashed")Dropping the large residuals results in 116 days dropped from our dataset.
births_df_pruned <- births_df |>
filter(abs(residuals_wday) <
3*sd(births_df$residuals_wday))
time_series_base %+% births_df_pruned +
facet_wrap(~wday(date, label = T), ncol = 2) +
labs(title = "Outlier pruned" %>% str_wrap(45)) In the remainder of the exploration and analysis, we’ll use the pruned version of the data, unless otherwise specified.
Seasonality and longer term trends
To look at the within-year periodicity, we use a normalized date variable, ‘date_as_if_in_2020’. We see summer and fall surges in number of births.
ggplot(births_df_pruned) +
aes(x = date_as_if_in_2020,
y = births) +
geom_point() +
aes(color = factor(year)) +
labs(color = NULL) +
facet_wrap(~wday(date, label = T)) +
facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 2) +
geom_smooth() +
labs(title = "Using loess smoothing within year and weekend v. weekday" ) +
scale_x_date(labels = c("Jan", "April", "July", "Oct", "Jan")) +
labs(x = NULL)In the plot above, we also see differences in the rate of births by year. Exploring that more directly, we look at the distributions of number of daily birth by year (this is done with the unpruned data). Diamonds mark the mean for each year.
ggplot(births_df) +
aes(x = births, y = factor(year)) +
# geom_histogram() +
ggridges::geom_density_ridges() +
stat_summary(fun.y = mean, geom = "point", col = "goldenrod3",
size = 10, shape = "diamond") +
labs(x = "Daily births")As a more general point of interest, here we summarize to the total number of births for each year in this period in the United States. Then number of annual births hovers around 4 million, reaching a height of almost 4.4 million births in 2007.
ggplot(births_df) +
aes(x = year, weight = births/1000000) +
ggchalkboard:::theme_chalkboard() +
geom_bar(fill = "lightyellow", alpha = .75) +
stat_count(geom = "text",
linetype = "dashed",
aes(label = after_stat(round(count, 2))),
vjust = 1.4, size = 3,
color = "darkseagreen4") +
labs(y = "Number of births (millions)") +
labs(title = "Number of births in the U.S. (millions)") +
scale_x_continuous(breaks = 2000:2014, labels = c("2000", "", "", "","'04", "", "", "","'08", "", "", "","'12", "", "'14"))Multiple linear regression modeling
Given the relationships uncovered in the exploratory visualizations, factors that should be included in any model of the number of daily births in the US should include seasonality, long term trends and day-of-week effects. We also model from our pruned data, dropping likely holidays as we are not interested in explaining the variation related to holidays given the 29th’s non-holiday status.
We account for long term trends by using indicator variables for each year; for seasonality using month indicator variables for each month; and for day-of-week effects using an indicator variable for each day of the week. The variable of interest, ‘being February 29th’ is also an indicator variable.
births_base_model <- lm(formula =
births ~ factor(year) +
month +
factor(day_of_week) +
ind_Feb_29th,
data = births_df_pruned) Based on model diagnostic plots, the inclusion of some observations do deserve additional review.
births_base_model_feb29 <- lm(formula =
births ~ factor(year) + month + factor(day_of_week) + ind_Feb_29th,
data = births_df_pruned)
summary(births_base_model_feb29)
#>
#> Call:
#> lm(formula = births ~ factor(year) + month + factor(day_of_week) +
#> ind_Feb_29th, data = births_df_pruned)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2672.19 -224.51 8.16 226.67 2079.45
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 11714.90 31.87 367.591 < 2e-16 ***
#> factor(year)2001 -77.34 30.84 -2.508 0.012180 *
#> factor(year)2002 -111.49 30.84 -3.615 0.000303 ***
#> factor(year)2003 65.26 30.84 2.116 0.034401 *
#> factor(year)2004 97.44 30.84 3.160 0.001587 **
#> factor(year)2005 187.56 30.80 6.090 1.21e-09 ***
#> factor(year)2006 546.64 30.82 17.737 < 2e-16 ***
#> factor(year)2007 671.91 30.86 21.771 < 2e-16 ***
#> factor(year)2008 446.51 30.81 14.490 < 2e-16 ***
#> factor(year)2009 132.71 30.84 4.303 1.72e-05 ***
#> factor(year)2010 -233.41 30.84 -7.568 4.44e-14 ***
#> factor(year)2011 -377.31 30.80 -12.251 < 2e-16 ***
#> factor(year)2012 -397.76 30.81 -12.909 < 2e-16 ***
#> factor(year)2013 -457.71 30.84 -14.841 < 2e-16 ***
#> factor(year)2014 -359.70 30.84 -11.663 < 2e-16 ***
#> month02 137.81 27.95 4.931 8.42e-07 ***
#> month03 120.56 27.23 4.427 9.76e-06 ***
#> month04 16.47 27.46 0.600 0.548769
#> month05 271.54 27.46 9.888 < 2e-16 ***
#> month06 476.18 27.46 17.342 < 2e-16 ***
#> month07 831.78 27.44 30.310 < 2e-16 ***
#> month08 889.65 27.23 32.666 < 2e-16 ***
#> month09 1155.29 27.71 41.687 < 2e-16 ***
#> month10 371.83 27.24 13.653 < 2e-16 ***
#> month11 422.16 27.95 15.103 < 2e-16 ***
#> month12 458.54 27.68 16.566 < 2e-16 ***
#> factor(day_of_week)2 1039.11 21.25 48.907 < 2e-16 ***
#> factor(day_of_week)3 811.87 21.25 38.213 < 2e-16 ***
#> factor(day_of_week)4 852.05 21.34 39.931 < 2e-16 ***
#> factor(day_of_week)5 554.31 21.35 25.959 < 2e-16 ***
#> factor(day_of_week)6 -3588.79 21.18 -169.456 < 2e-16 ***
#> factor(day_of_week)7 -4635.95 21.17 -218.957 < 2e-16 ***
#> ind_Feb_29thTRUE -864.33 207.52 -4.165 3.16e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 412.5 on 5330 degrees of freedom
#> Multiple R-squared: 0.9679, Adjusted R-squared: 0.9677
#> F-statistic: 5027 on 32 and 5330 DF, p-value: < 2.2e-16Narrow to calendar peers
# ggchalkboard:::geoms_chalk_off()
births_df |>
filter(date_as_if_in_2020 >= as.Date("2020-02-20")) |>
filter(date_as_if_in_2020 <= as.Date("2020-03-13")) |>
ggplot() +
aes(x = date_as_if_in_2020, y = births) +
geom_line( alpha= .2) +
geom_point(aes(shape = ind2cat::ind_recode(ind_weekend),
fill = ind2cat::ind_recode(ind_Feb_29th),
size = ind2cat::ind_recode(ind_Feb_29th)),
shape = 21) +
geom_text(aes(label = wday(date, label = T)),
vjust = -0.52, size = 3) +
# geom_text(aes(label = births),
# vjust = 1.2) +
facet_wrap(~year) +
scale_size_discrete(range = c(4,6))
last_plot() +
facet_null() +
aes(group = paste(year, ind_weekend, isoweek(date))) +
aes(color = year)births_model_feb29_peers <- lm(formula =
births ~ factor(year) +
date_as_if_in_2020 +
factor(day_of_week) +
ind_Feb_29th,
data = births_df_pruned |>
filter(date_as_if_in_2020 >= as.Date("2020-02-16")) |>
filter(date_as_if_in_2020 <= as.Date("2020-03-9")))
summary(births_model_feb29_peers)
#>
#> Call:
#> lm(formula = births ~ factor(year) + date_as_if_in_2020 + factor(day_of_week) +
#> ind_Feb_29th, data = filter(filter(births_df_pruned, date_as_if_in_2020 >=
#> as.Date("2020-02-16")), date_as_if_in_2020 <= as.Date("2020-03-9")))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -997.5 -171.2 -9.9 178.3 1042.1
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -21789.586 42610.810 -0.511 0.609
#> factor(year)2001 -99.521 85.558 -1.163 0.246
#> factor(year)2002 -48.996 85.548 -0.573 0.567
#> factor(year)2003 113.185 85.558 1.323 0.187
#> factor(year)2004 138.110 84.382 1.637 0.103
#> factor(year)2005 96.470 85.519 1.128 0.260
#> factor(year)2006 477.589 85.508 5.585 5.09e-08 ***
#> factor(year)2007 734.570 85.558 8.586 4.40e-16 ***
#> factor(year)2008 559.862 84.381 6.635 1.44e-10 ***
#> factor(year)2009 384.811 85.548 4.498 9.69e-06 ***
#> factor(year)2010 -128.954 85.559 -1.507 0.133
#> factor(year)2011 -396.802 85.519 -4.640 5.14e-06 ***
#> factor(year)2012 -495.252 84.343 -5.872 1.11e-08 ***
#> factor(year)2013 -566.769 85.548 -6.625 1.53e-10 ***
#> factor(year)2014 -405.133 85.558 -4.735 3.33e-06 ***
#> date_as_if_in_2020 1.823 2.326 0.784 0.434
#> factor(day_of_week)2 1112.832 59.098 18.830 < 2e-16 ***
#> factor(day_of_week)3 988.488 58.821 16.805 < 2e-16 ***
#> factor(day_of_week)4 1034.869 58.743 17.617 < 2e-16 ***
#> factor(day_of_week)5 841.286 58.827 14.301 < 2e-16 ***
#> factor(day_of_week)6 -3204.244 58.736 -54.554 < 2e-16 ***
#> factor(day_of_week)7 -4258.234 58.818 -72.397 < 2e-16 ***
#> ind_Feb_29thTRUE -867.752 146.946 -5.905 9.22e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 285.9 on 311 degrees of freedom
#> Multiple R-squared: 0.9835, Adjusted R-squared: 0.9824
#> F-statistic: 844 on 22 and 311 DF, p-value: < 2.2e-16
ggplot(fortify(births_model_feb29_peers)) +
aes(x = .fitted, y = .resid) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, color = "darkred") +
geom_smooth()